High-Performance Tagging on Medical Texts

نویسندگان

  • Udo Hahn
  • Joachim Wermter
چکیده

We ran both Brill’s rule-based tagger and TNT, a statistical tagger, with a default German newspaper-language model on a medical text corpus. Supplied with limited lexicon resources, TNT outperforms the Brill tagger with state-of-the-art performance figures (close to 97% accuracy). We then trained TNT on a large annotated medical text corpus, with a slightly extended tagset that captures certain medical language particularities, and achieved 98% tagging accuracy. Hence, statistical off-the-shelf POS taggers cannot only be immediately reused for medical NLP, but they also – when trained on medical corpora – achieve a higher performance level than for the newspaper genre.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Part-of-Speech Tagging for Historical English

As more historical texts are digitized, there is interest in applying natural language processing tools to these archives. However, the performance of these tools is often unsatisfactory, due to language change and genre differences. Spelling normalization heuristics are the dominant solution for dealing with historical texts, but this approach fails to account for changes in usage and vocabula...

متن کامل

Performance and error analysis of three part of speech taggers on health texts

Increasingly, natural language processing (NLP) techniques are being developed and utilized in a variety of biomedical domains. Part of speech tagging is a critical step in many NLP applications. Currently, we are developing a NLP tool for text simplification. As part of this effort, we set off to evaluate several part of speech (POS) taggers. We selected 120 sentences (2375 tokens) from a corp...

متن کامل

BaseNP Supersense Tagging for Japanese Texts

This paper describes baseNP supersense tagging for Japanese texts. The task extracts base noun phrases (baseNPs) from raw texts in Japanese, and labels their baseNPs with supersenses. This task has a number of applications including predicate argument structure analysis and question answering. While the definition of baseNP in English is relatively clear, its definition in Japanese has not yet ...

متن کامل

Alignment Across Oriental and Indo-European Languages

The linguistic characteristics of Oriental languages and Indo-European languages are very different. Using purely length-based algorithm could not produce high performance on aligning texts. This paper investigates the effectiveness of critical part-of-speech (POS) criterion on alignment under conditions of different search strategies and different register texts. Two metrics, recall and precis...

متن کامل

Verb Detection in Persian Corpus

A novel technique is introduced for verb and inflection detection in Persian texts. This recognition can be useful for preprocessing phase in natural language processing (NLP) and text mining like partof-speech (POS) tagging and sentence boundary detection (SBD) in Persian texts. Our technique employs structural information of Persian verb for the first phase of this detection and then uses the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004